Bloom Filter for Web Scraping Deduplication: Principle, Python, Redis

A Bloom Filter for web scraping deduplication solves a common problem in crawler systems: fast existence checks across massive datasets. It is especially useful when you can tolerate a small false-positive rate, such as preventing cache penetration, URL deduplication, and large-scale “seen-before” checks.Deduplication is an important part of a scalable Web Scraping API architecture.

To understand the idea quickly, start with a classic “minimum mice” thought experiment. Then, we will connect that logic to Bloom Filters and finish with a practical Bloom Filter for web scraping deduplication implementation in Python and Redis.

Outbound references (docs):

Bloom filter (concept overview)
RedisBloom commands

What problem does a Bloom Filter solve?

In large-scale scraping, you often need to answer one question quickly:

“Have I already crawled this URL (or item) before?”

A traditional set or database index works, but it consumes more memory and may slow down under heavy concurrency. In contrast, a Bloom Filter uses a compact bit array and hash functions. As a result, it can handle large-scale existence checks with very low memory cost.

This is why a Bloom Filter for web scraping deduplication becomes a popular choice in distributed crawler systems.

The 7 mice problem and the Bloom Filter intuition

Imagine 100 bottles of liquid. One bottle contains poison, and a single drop kills a mouse within 24 hours. You want to identify the poisoned bottle in one day, using the fewest mice.

Instead of using 100 mice, you can solve it with only 7 mice, because:

7 bits can represent up to 2⁷ = 128 values.
100 bottles fit inside that range.

Step 1: Encode bottles in binary (7 bits)

Bottle 1 → 0000001
Bottle 10 → 0001010

Step 2: Feed based on bit positions

Assign Mouse 1–Mouse 7 to the 7 bit positions. Then:

Mouse 1 drinks from bottles where bit 1 is 1
Mouse 2 drinks from bottles where bit 2 is 1
…and so on

Step 3: Decode deaths after 24 hours

Dead mouse → that bit is 1
Alive mouse → that bit is 0

For example, if Mice 2, 4, 5, and 7 die, the binary result is:

0101101 → decimal 45 → Bottle 45 is poisoned.

This demonstrates the core idea: you can represent a large search space using compact bits. Likewise, a Bloom Filter stores membership hints inside a bit array.

How Bloom Filters work

A Bloom Filter has two main parts:

A bit array (each position is 0 or 1)
Multiple hash functions (or a hash function with multiple seeds)

Add an element

Compute k hash values
Map them into bit array positions
Set those bits to 1

Check existence

Compute the same k hashes
If any bit is 0, the element definitely does not exist
If all bits are 1, the element probably exists

Key conclusion:

“Not exists” is always correct.
“Exists” can be wrong (false positive).

That false-positive possibility is the trade-off for speed and memory savings in a Bloom Filter for web scraping deduplication.

Pros and cons of Bloom Filters

Pros

Very memory efficient (bit-level storage)
Very fast inserts and queries
Great for large-scale existence checks

Cons

False positives are unavoidable
Deleting elements is difficult in the classic design
False positives increase as you add more items

Python Bloom Filter implementation (bitarray + mmh3)

Install dependencies:

pip install bitarray mmh3

Implementation:

from bitarray import bitarray
import mmh3

class BloomFilter(set):
    def __init__(self, size: int, hash_count: int):
        super().__init__()
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)
        self.size = size
        self.hash_count = hash_count

    def add(self, item: str):
        for i in range(self.hash_count):
            index = mmh3.hash(item, i) % self.size
            self.bit_array[index] = 1
        return self

    def __contains__(self, item: str) -> bool:
        return all(
            self.bit_array[mmh3.hash(item, i) % self.size]
            for i in range(self.hash_count)
        )

# Usage
bloom = BloomFilter(size=100, hash_count=10)
urls = ["www.baidu.com", "www.qq.com", "www.jd.com"]

for url in urls:
    bloom.add(url)

print("www.baidu.com" in bloom)  # True
print("www.fake.com" in bloom)   # False (or rarely True)

Tip: increase size and tune hash_count when you store more URLs, because that reduces false positives.

Redis integration for distributed deduplication

A single-machine Bloom Filter works well, but distributed crawlers need shared state. Therefore, Redis Bloom Filter support is a common upgrade path.

Install client:

pip install redis

Example (RedisBloom commands):

import redis

r = redis.Redis(host="localhost", port=6379, db=0)

# Create Bloom filter:
# error_rate=0.01, capacity=100000
r.execute_command("BF.RESERVE", "myBloomFilter", 0.01, 100000)

# Add and check
items = ["url1", "url2"]
for item in items:
    r.execute_command("BF.ADD", "myBloomFilter", item)

print(r.execute_command("BF.EXISTS", "myBloomFilter", "url1"))  # 1
print(r.execute_command("BF.EXISTS", "myBloomFilter", "urlX"))  # 0 (usually)

In a real distributed workflow, replace localhost with your shared Redis host. Then every crawler node uses the same Bloom Filter state, so deduplication stays consistent.

Conclusion

A Bloom Filter for web scraping deduplication provides an efficient trade-off: it dramatically reduces memory usage and keeps existence checks fast, while accepting a small false-positive rate. With Python, you can implement it locally in minutes. With Redis, you can scale the same idea across distributed crawler clusters.In production systems, these techniques are commonly used downstream of a web scraping API.